import glob
import pandas as pd
import numpy as np
import xlrd
import warnings
'ignore') warnings.filterwarnings(
Alberta In-situ Oilsands Production Analysis using Python - Part II
Preparing a tidy dataset for data analysis and visualization
Disclaimer
This blog post is for educational purposes only. Any commercial use of the information provided in this blog post is prohibited. For more information about the AER copyright and permission to reproduce, please visit AER Copyright and Disclaimer. The author is not responsible for any damage or loss caused by the use of the information provided in this blog post.
Introduction
This blog post focuses on analyzing the bitumen production data from in-situ oilsands projects in Alberta. Alberta Energy Regulator (AER) lists the monthly injection and production data for all in-situ oilsands projects in Alberta on its website Alberta Energy Regulator. The data is available in several spreadsheets and can be downloaded from ST53: Alberta In Situ Oil Sands Production Summary. The annual data are reported in separate files in .xls format. Each file has 6 different sheets with the following information: bitumen production, water usage, steam injection, well count, steam-to-oil ratio (SOR), and water-to-oil ratio (WSR).
In the previous post Alberta In-situ Oilsands Production Analysis using Python - Part I, I have shown how to download the data using python. In this post we create a tidy bitumen dataset for data analysis and visualization.
Loading the Data
All files are stored in the current local directory. We use the glob
module to list all .xls
files in the directory and then use pandas
to read the data into a DataFrame
. The xlrd
is used as an engine in pandas to read the .xls
files. The numpy
module is used to perform mathematical operations on the data. The warnings
module is used to suppress the warnings.
The glob
module is used in this code to create a list of file paths for all the .xls files in the current directory.
= glob.glob('*.xls')
xls_files xls_files
['ST53_2010-12.xls',
'ST53_2011-12.xls',
'ST53_2012-12.xls',
'ST53_2013-12.xls',
'ST53_2014-12.xls',
'ST53_2016-12.xls',
'ST53_2017-12.xls',
'ST53_2018-12.xls',
'ST53_2019-12.xls',
'ST53_2020-12.xls',
'ST53_2021-12.xls',
'ST53_2022-12.xls']
Analyzing the 2010 bitumen production data
This section describes the steps for creating a tidy dataset for bitumen production using the 2010 data.
Reading the first file
To start, we read the data from the ST53_2010.xls
file using the read_excel
function from the pandas library. We set the sheet_name
argument to None
to read all the sheets in the file. We also use the skiprows
argument to skip the first three rows of each sheet. Finally, we use the keys()
method to print the names of the sheets in the file.
= pd.read_excel(xls_files[0], sheet_name = None, skiprows = 3)
sheets print(f'sheet names: {sheets.keys()}')
sheet names: dict_keys(['BITUMEN', 'WATER', 'STEAM', 'WELLS', 'SOR', 'WSR'])
Reading the BITUMEN sheet
# read the BITUMEN sheet
= sheets['BITUMEN'] bitumen_2010
The info
method is used to print the column names and data types of each column in the DataFrame.
bitumen_2010.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177 entries, 0 to 176
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Operator 172 non-null object
1 Scheme Name 26 non-null object
2 Area 163 non-null object
3 Approval Number 163 non-null object
4 Recovery Method 164 non-null object
5 Jan 166 non-null float64
6 Feb 166 non-null float64
7 Mar 166 non-null float64
8 Apr 165 non-null float64
9 May 165 non-null float64
10 Jun 162 non-null float64
11 Jul 162 non-null float64
12 Aug 162 non-null float64
13 Sep 161 non-null float64
14 Oct 159 non-null float64
15 Nov 159 non-null float64
16 Dec 156 non-null float64
17 Monthly Average 153 non-null float64
dtypes: float64(13), object(5)
memory usage: 25.0+ KB
The head
method is used to display the first 5 rows of the DataFrame
.
bitumen_2010.head()
Operator | Scheme Name | Area | Approval Number | Recovery Method | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Monthly Average | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Canadian Natural Resources Limited | NaN | Cold Lake | 4746P | Commercial | 587.48 | 596.72 | 637.59 | 678.74 | 705.61 | 718.99 | 697.25 | 696.88 | 763.21 | 702.21 | 696.32 | 674.99 | 679.67 |
1 | Canadian Natural Resources Limited | NaN | Cold Lake | 6726I | Commercial | 171.54 | 193.66 | 187.71 | 177.54 | 176.90 | 174.50 | 160.53 | 155.97 | 168.22 | 168.27 | 139.73 | 143.48 | 168.17 |
2 | Baytex Energy Ltd. (¹) | Cliffdale Pilot | Peace River Area 2 | 11034E | Commercial-CSS | 0.00 | 2.55 | 6.15 | 2.25 | 4.16 | 10.59 | 5.83 | 3.27 | 9.18 | 20.09 | 15.42 | 13.15 | 8.42 |
3 | North Peace Energy Corp. | Red Earth | Peace River Area 2 | 11209A | Commercial-CSS | 10.30 | 4.70 | 3.11 | 2.21 | 1.63 | 0.51 | 0.00 | 0.00 | 0.40 | 0.00 | 0.00 | 0.00 | 3.27 |
4 | Penn West Petroleum Ltd. | Seal | Peace River Area 2 | 11377A | Commercial-CSS | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | NaN |
The tail
method is used to print the last 20 rows of the DataFrame
.
20) bitumen_2010.tail(
Operator | Scheme Name | Area | Approval Number | Recovery Method | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Monthly Average | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
157 | Bronco Energy Ltd. | NaN | Athabasca | 9774E | Primary | 28.65 | 28.58 | 27.94 | 22.84 | 91.39 | 80.23 | 77.59 | 83.21 | 64.28 | 74.45 | 69.66 | 63.98 | 59.40 |
158 | Devon Canada Corporation | NaN | Cold Lake | 9821 | Primary | 4.00 | 3.41 | 3.42 | 0.00 | 1.87 | 0.00 | 1.74 | 1.04 | 4.18 | 4.09 | 1.94 | 2.76 | 2.85 |
159 | Frog Lake Energy Resources Corp. | NaN | Cold Lake | 9827B | Primary | 11.42 | 5.22 | 5.11 | 8.76 | 12.08 | 16.21 | 13.78 | 11.12 | 17.36 | 18.75 | 19.41 | 22.18 | 13.45 |
160 | Husky Oil Operations Limited | NaN | Cold Lake | 9873 | Primary | 12.32 | 12.80 | 12.66 | 11.27 | 12.93 | 11.56 | 11.63 | 11.27 | 10.62 | 10.79 | 11.50 | 10.50 | 11.65 |
161 | Canadian Natural Resources Limited | NaN | Athabasca | 9884D | Primary | 65.64 | 68.55 | 69.79 | 64.72 | 68.09 | 64.18 | 63.37 | 66.10 | 64.25 | 62.55 | 59.87 | 55.29 | 64.37 |
162 | Koch Exploration Canada G/P Ltd. (¹) | NaN | Cold Lake | 9908D | Primary | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | NaN |
163 | NaN | NaN | NaN | NaN | Total | 109423.83 | 116311.68 | 120054.09 | 116601.43 | 121671.30 | 122439.61 | 122234.75 | 121215.19 | 108260.73 | 118808.23 | 128196.96 | 129913.57 | 119594.28 |
164 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
165 | Conventional Bitumen Recovery | NaN | NaN | NaN | NaN | 537.49 | 544.67 | 560.36 | 541.87 | 560.71 | 524.96 | 482.84 | 555.27 | 555.21 | 545.33 | 601.13 | 593.86 | 550.31 |
166 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
167 | Total | NaN | NaN | NaN | NaN | 109961.32 | 116856.35 | 120614.45 | 117143.30 | 122232.01 | 122964.57 | 122717.59 | 121770.46 | 108815.94 | 119353.56 | 128798.09 | 130507.43 | 120144.59 |
168 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
169 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
170 | Notes: | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
171 | - Monthly production totals may be revised due... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
172 | - Total may not add due to rounding | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
173 | - Conventional Bitumen production consists of ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
174 | - Monthly average does not include months ther... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
175 | ¹ Amended this month | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
176 | ¹ Amended this month | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Since the number of producers in the file can vary, we cannot rely on a fixed index number to subset the dataset. However, the last row of the dataset always contains a summary of the total production, which is identified by the cell value of Total in the Recovery Method
column. Therefore, we can use the index number of the row with Total in the Recovery Method column to subset the dataset and exclude the summary row and any other rows with additional information.
# get the index of the last row using np.where
= np.where(bitumen_2010[["Recovery Method"]] == "Total ")[0][0]
last_row_idx = bitumen_2010.iloc[:last_row_idx, :] bitumen_2010
In addition, we can remove the Monthly Average
column from the dataset since we will be aggregating the data in our analysis and therefore do not need the monthly averages.
= bitumen_2010.drop(columns = ['Monthly Average']) bitumen_2010
Cleaning the Operator column
Let’s take a look at the Operator
column.
'Operator'].unique() bitumen_2010[
array(['Canadian Natural Resources Limited', 'Baytex Energy Ltd. (¹)',
'North Peace Energy Corp.', 'Penn West Petroleum Ltd.',
'Baytex Energy Ltd. (¹²)', 'Shell Canada Limited',
'Imperial Oil Resources (¹)',
'Canadian Natural Resources Limited (¹)',
'Devon NEC Corporation (¹)', 'Connacher Oil And Gas Limited',
'Meg Energy Corp.', 'Statoil Canada Ltd. (¹)',
'Cenovus Energy Inc.', 'Cenovus FCCL Ltd.', 'Suncor Energy Inc.',
'Japan Canada Oil Sands Limited (¹)', 'Suncor Energy Inc. (¹)',
'Total E&P Joslyn Ltd. (¹)',
'ConocoPhillips Canada Resources Corp.',
'ConocoPhillips Canada Limited', 'Nexen Inc.',
'Husky Oil Operations Limited', 'Canadian Natural Resources',
'Murphy Oil Company Ltd.', 'Cenovus Energy Inc. (¹)',
'Prosper Petroleum Ltd.', 'Penn West Energy Trust',
'Baytex Energy Ltd.', 'Shell Canada Energy',
'Husky Oil Operations Ltd.', 'Action Energy Inc.',
'Devon Canada Corporation', 'Bonavista Petroleum Ltd.',
'Murphy Oil Canada', 'Watch Resources Ltd.',
'Manitok Exploration Inc.', 'Koch Exploration Canada G/P Ltd.',
'Husky Oil Operations Limited (et Al)', 'Twin Butte Energy Ltd.',
'Sunshine Oilsands Ltd.', 'Solara Exploration Ltd.',
'Canadian Natural Resources Limited (¹²)',
'Bellatrix Exploration Ltd. (¹²)', 'Blackpearl Resources Inc. (¹)',
'Crispin Energy Inc.', 'Husky Oil Operations Limited (¹)',
'Buffalo Resources Corp.', 'ExxonMobil Canada Ltd.',
'Harvest Operations Corp.', 'Bronco Energy Ltd.',
'Frog Lake Energy Resources Corp.',
'Koch Exploration Canada G/P Ltd. (¹)'], dtype=object)
The Operator column contains several names that are followed by (subscripts)
. These additional pieces of information create inconsistencies in the Operator
column and add unnecessary complexity to the dataset. Since they are not needed for our analysis, we can remove them using the replace
method.
'Operator'] = bitumen_2010['Operator'].str.replace(r'\(.*\)', '').str.strip()
bitumen_2010['Operator'].unique() bitumen_2010[
array(['Canadian Natural Resources Limited', 'Baytex Energy Ltd.',
'North Peace Energy Corp.', 'Penn West Petroleum Ltd.',
'Shell Canada Limited', 'Imperial Oil Resources',
'Devon NEC Corporation', 'Connacher Oil And Gas Limited',
'Meg Energy Corp.', 'Statoil Canada Ltd.', 'Cenovus Energy Inc.',
'Cenovus FCCL Ltd.', 'Suncor Energy Inc.',
'Japan Canada Oil Sands Limited', 'Total E&P Joslyn Ltd.',
'ConocoPhillips Canada Resources Corp.',
'ConocoPhillips Canada Limited', 'Nexen Inc.',
'Husky Oil Operations Limited', 'Canadian Natural Resources',
'Murphy Oil Company Ltd.', 'Prosper Petroleum Ltd.',
'Penn West Energy Trust', 'Shell Canada Energy',
'Husky Oil Operations Ltd.', 'Action Energy Inc.',
'Devon Canada Corporation', 'Bonavista Petroleum Ltd.',
'Murphy Oil Canada', 'Watch Resources Ltd.',
'Manitok Exploration Inc.', 'Koch Exploration Canada G/P Ltd.',
'Twin Butte Energy Ltd.', 'Sunshine Oilsands Ltd.',
'Solara Exploration Ltd.', 'Bellatrix Exploration Ltd.',
'Blackpearl Resources Inc.', 'Crispin Energy Inc.',
'Buffalo Resources Corp.', 'ExxonMobil Canada Ltd.',
'Harvest Operations Corp.', 'Bronco Energy Ltd.',
'Frog Lake Energy Resources Corp.'], dtype=object)
The operator names in the dataset may contain inconsistencies where some operators are listed with different names. To make the names consistent, we can create a dictionary with the names to be replaced and the new names. However, it’s important to note that the following dictionary may contain additional names that are not in the 2010 dataset. This dictionary was created using data from all the files between 2010 and 2022.
# create a dictionary to make Operator names consistent
= {'Athabasca Oil Corporation': 'Athabasca Oil',
operators_dict 'Baytex Energy Ltd.': 'Baytex',
'Blackpearl Resources Inc.': 'BlackPearl',
'BlackPearl Resources Inc.': 'BlackPearl',
'Bonavista Petroleum Ltd.': 'Bonavista',
'Bonavista Energy Corporation': 'Bonavista',
'CNOOC Petroleum North America ULC': 'CNOOC',
'Canadian Natural Resources': 'CNRL',
'Canadian Natural Resources Limited': 'CNRL',
'Canadian Natural Upgrading Limited': 'CNRL',
'Cenovus Energy Inc.': 'Cenovus',
'Cenovus FCCL Ltd.': 'Cenovus',
'ConocoPhillips Canada Resources Corp.': 'ConocoPhillips',
'ConocoPhillips Canada Limited': 'ConocoPhillips',
'Devon Canada Corporation': 'Devon',
'Devon NEC Corporation': 'Devon',
'ExxonMobil Canada Ltd.': 'Imperial',
'Greenfire Hangingstone Operating Corporation': 'Greenfire',
'Greenfire Resources Operating Corporation': 'Greenfire',
'Husky Oil Operations Limited' : 'Husky',
'Husky Oil Operations Ltd.' : 'Husky',
'Imperial Oil Resources': 'Imperial',
'Imperial Oil Resources Limited': 'Imperial',
'Islander Oil & Gas Inc.': 'Islander',
'Koch Exploration Canada G/P Ltd.': 'Koch',
'Koch Oil Sands Operating ULC': 'Koch',
'MEG Energy Corp.': 'MEG',
'Meg Energy Corp.': 'MEG',
'Murphy Oil Canada': 'Murphy',
'Murphy Oil Company Ltd.': 'Murphy',
'Nexen Energy ULC': 'Nexen',
'Nexen Inc.': 'Nexen',
'Obsidian Energy Ltd.': 'Obsidian',
'OSUM Oil Sands Corp.': 'OSUM',
'Osum Production Corp.': 'OSUM',
'Pengrowth Corporation': 'Pengrowth',
'Pengrowth Energy Corporation': 'Pengrowth',
'Penn West Energy Trust': 'Penn West',
'Penn West Petroleum Ltd.': 'Penn West',
'Perpetual Energy Inc.': 'Perpetual',
'Perpetual Energy Operating Corp.': 'Perpetual',
'Perpetual Operating Corp.': 'Perpetual',
'PetroChina Canada Ltd.': 'PetroChina',
'Petrochina Canada Ltd.': 'PetroChina',
'Strathcona Resources Ltd.': 'Strathcona',
'Shell Canada Energy': 'Shell',
'Shell Canada Limited': 'Shell',
'Spur Petroleum Ltd.': 'Spur',
'Spur Resources Ltd.': 'Spur',
'Suncor Energy Inc.': 'Suncor',
'Woodcote Oil & Gas Inc.': 'Woodcote',
'Woodcote Oil Corp.': 'Woodcote'
}
# replace the names in the Operator column
'Operator'] = bitumen_2010['Operator'].replace(operators_dict)
bitumen_2010['Operator'].unique() bitumen_2010[
array(['CNRL', 'Baytex', 'North Peace Energy Corp.', 'Penn West', 'Shell',
'Imperial', 'Devon', 'Connacher Oil And Gas Limited', 'MEG',
'Statoil Canada Ltd.', 'Cenovus', 'Suncor',
'Japan Canada Oil Sands Limited', 'Total E&P Joslyn Ltd.',
'ConocoPhillips', 'Nexen', 'Husky', 'Murphy',
'Prosper Petroleum Ltd.', 'Action Energy Inc.', 'Bonavista',
'Watch Resources Ltd.', 'Manitok Exploration Inc.', 'Koch',
'Twin Butte Energy Ltd.', 'Sunshine Oilsands Ltd.',
'Solara Exploration Ltd.', 'Bellatrix Exploration Ltd.',
'BlackPearl', 'Crispin Energy Inc.', 'Buffalo Resources Corp.',
'Harvest Operations Corp.', 'Bronco Energy Ltd.',
'Frog Lake Energy Resources Corp.'], dtype=object)
It should be noted that there have been a number of acquisitions and mergers in the oil and gas industry in recent years, and this can lead to inconsistencies in the naming of operators across different years. However, for the sake of simplicity, we will ignore these changes and assume that operator names remain consistent across all years.
Simplifiying the Area column
To simplify the Area
column, we can replace the values Peace River Area 1
and Peace River Area 2
with a single value Peace River
. This is done to aggregate the production data for the Peace River area.
'Area'] = bitumen_2010['Area'].replace({'Peace River Area 1': 'Peace River', 'Peace River Area 2': 'Peace River'})
bitumen_2010['Area'].unique() bitumen_2010[
array(['Cold Lake', 'Peace River', 'Athabasca, Cold Lake', 'Athabasca'],
dtype=object)
There is only one row with an Area
value of Athabasca, Cold Lake
. To make it consistent with the other rows, we can change it to Cold Lake
.
'Area'] = bitumen_2010['Area'].replace({'Athabasca, Cold Lake': 'Cold Lake'}) bitumen_2010[
We also add a Year
column to the dataframe.
'Year'] = int(xls_files[0].split('_')[-1].split('-')[0])
bitumen_2010['Year'].unique() bitumen_2010[
array([2010], dtype=int64)
Creating a single tidy dataset for bitumen production for all years
We can combine all the previous steps to create a function that generates a tidy dataset for the bitumen analysis. The function takes the following arguments:
df
: a dataframe for a given yearoperators_dict
: a dictionary with the names to be replaced and the new namesxls_file
: the name of the excel file for a given year
# define a function to create a tidy dataset for bitumen production
def create_tidy_bitumen(df, operators_dict, xls_file):
= np.where(df[["Recovery Method"]] == "Total ")[0][0]
last_row_idx = df.iloc[:last_row_idx, :]
df = (df
df = ['Monthly Average'])
.drop(columns = ['Operator'])
.dropna(subset = lambda x: x.Operator.str.split('(').str[0].str.strip())
.assign(Operator = lambda x: x.Operator.str.replace(' ', ' '))
.assign(Operator 'Operator': operators_dict})
.replace({= int(xls_file.split('_')[-1].split('-')[0]))
.assign(Year = lambda x: x.Area.replace({'Peace River Area 1': 'Peace River', 'Peace River Area 2': 'Peace River'}))
.assign(Area
)return df
To generate a tidy dataset for bitumen production for all years, we can use the previously defined function to create tidy datasets for each year and then use a for loop to iterate through all the .xls files. For each year, we generate a tidy dataset using the function and append it to the bitumen
dataframe using the df.append()
method.
# create a tidy dataset for bitumen production for all years
= pd.DataFrame()
bitumen for xls_file in xls_files:
= pd.read_excel(xls_file, sheet_name = None, skiprows = 3)
df = bitumen.append(create_tidy_bitumen(df['BITUMEN'], operators_dict, xls_file)) bitumen
Let’s check the Year
column to make sure that we have data for all years.
'Year'].unique() bitumen[
array([2010, 2011, 2012, 2013, 2014, 2016, 2017, 2018, 2019, 2020, 2021,
2022], dtype=int64)
Saving the dataset
We save the dataset as a csv file in the current local folder for future use.
'bitumen.csv', index = False) bitumen.to_csv(